19. Gini Impurity

Gini Impurity

So far, you've seen how to use entropy to calculate the information gain of a split. There is another alternative for measuring the quality of a split.

If there are k classes, and \hat{p}_k is the fraction of observations from class k classified by a node, we can calculate G (the Gini index) for the node:

G = \sum^K_{k=1}\hat{p}_{k}(1-\hat{p}_{k})

The Gini index takes on a small value if all of the proportions are close to zero or one. You can think of it as a measure of node purity —if the value is small, the node mostly contains observations from a single class. It turns out that the Gini index and entropy are quite similar numerically.

To measure the increase in purity of a split using the Gini index, calculate the Gini index on the parent node and subtract the weighted average of the Gini indexes of the child nodes:

G_{\mathrm{increase}}=G_{\mathrm{parent}} - \sum_{\mathrm{children}}(\mathrm{fraction\;of\;observations})_{\mathrm{child}} \times G_{\mathrm{child}}

Scikit-learn supports both the Gini impurity and information gain metrics for evaluating the quality of splits, via the criterion hyperparameter.

Gini impurity

What is the decrease in Gini impurity for the split indicated in the image above?

SOLUTION: 0.18